| |
Regular expression search engine for text in UCS2 form taking surrogates into
account. This implementation is an improved translation from the URE package
written by Mark Leisher who used a variation of the RE->DFA algorithm done by
Mark Hopkins.
Assumptions
- Regular expression and text already normalized
- Conversion to lower case assumes a 1-1 mapping.
Definitions
Separator - any one of U+2028, U+2029, NL, CR.
Operators
- . - match any character
- * - match zero or more of the last sub expression
- + - match one or more of the last sub expression
- ? - match zero or one of the last sub expression
- () - sub expression grouping
- {m, n} - match at least m occurrences and up to n occurrences
Note: both values can be 0 or omitted which denotes then a unlimited bound {,}
and {0,} and {0, 0} correspond to * {, 1} and {0, 1} correspond to ? {1,} and
{1, 0} correspond to +
- {m} - match exactly m occurrences
Notes
The "." operator normally does not match separators, but a flag is available
that will allow this operator to match a separator.
Literals and Constants
- c - literal UCS2 character
- \x.... - hexadecimal number of up to 4 digits
- \X.... - hexadecimal number of up to 4 digits
- \u.... - hexadecimal number of up to 4 digits
- \U.... - hexadecimal number of up to 4 digits
Character classes
- [...] - Character class
- [^...] - Negated character class
- \pN1,N2,...,Nn - Character properties class
- \PN1,N2,...,Nn - Negated character properties class
POSIX character classes recognized: :alnum: :alpha: :cntrl: :digit: :graph:
:lower: :print: :punct: :space: :upper: :xdigit:
Notes
- Character property classes are \p or \P followed by a comma separated list
of integers between 0 and the maximum entry index in TCharacterCategory. These
integers directly correspond to the TCharacterCategory enumeration entries.
Note: upper, lower and title case classes need to have case sensitive search
be enabled to match correctly!
- Character classes can contain literals, constants and character property
classes. Example:
[abc\U10A\p0,13,4]
|